Skills: Sampling and confidence intervals

Week 2

In general, the data sets you work with represent samples of larger populations. Your goal is to be able to generalize from your sample to the population.

Sometimes this is literally and obviously true. For example, you might have sent a survey to one percent of all Canadian households, and you want to analyze the data to make some claims about the characteristics of Canadian households in general.

Other times, describing your sample in terms of the population it comes from is not as obvious. For example, you might have data about every county in the United States suggesting that incomes are higher in urban counties than in rural counties. You might argue that this is not a sample, because you have data on every county that exists. However, if you want to claim that this difference is based on a fundamental difference that is likely to always exist between urban and rural counties in the United States, then you need to be able to generalize this difference to a broader population, which would be all hypothetical counties that could ever exist in the United States.

This week, you’ll continue to work with the data you generated last week. This dataset represents the full population you are studying.

You will be doing the following:

Data

We’ll start with a dataset of 10,000 observations like the one you generated for your assignment last week.

R

Here, I’ll load a csv file with the dataset into my R environment

library(here)
library(tidyverse)
library(knitr)

full_data <- here("week1",
     "full-data.csv") |>
  read_csv()

Here are the first few rows:

head(full_data) |>
  kable()
sq_feet dt_dist color pool blue green rent
1195.304 3.505617 Red 1 0 0 961.4318
1502.807 4.577164 Red 0 0 0 923.1070
1839.383 2.606373 Red 0 0 0 1257.2496
1405.328 3.343034 Blue 0 1 0 1016.5779
1390.682 4.262791 Red 0 0 0 860.3381
1318.790 4.128393 Red 0 0 0 816.7330

Excel

I’ll use a spreadsheet with a completed example of last week’s assignment as a starting point.

You can find the spreadsheet here (click the download button).

Here is the sheet with the dataset.

Adding random error to an outcome

Remember that we calculated the outcome variable directly from the predictor variables, so the outcome is perfectly predicted by those predictors. To make things a little more realistic, let’s add some random noise to the outcome.

R

In R, you can use the rnorm() function to add random variation to a variable. Here I’ll add a random value to the variable representing monthly rent. This extra value has an average of zero (so negative numbers will be as common as positive numbers) and a standard deviation of 100 (most of the values in a normal distribution fall with about two standard deviations of the mean, so this is like making the outcome equal to the calculated value, plus or minus 200).

full_data = full_data |>
  mutate(rent = rent + rnorm(10000, mean = 0, sd = 100))

Excel

In Excel, you can add some noise to a variable using the by generating a new random variable that varies between 0 and 1:

And then adding a term to the outcome formula that will be a normally-distributed variable with a mean of zero and a standard deviation of 200.

Drawing a sample

Let’s say this full dataset represents the full population I’m interested in, but I don’t have access to the full dataset. I only have access to a sample of 100 observations.

Here is one example of a sample I might have access to.

R

In R, I can draw a random sample of 100 observations from a larger dataframe using the function sample_n().

sample_1 <- full_data |>
  sample_n(100)

Excel

In Excel, I can take the following steps to take a random sample from a larger data set.

First, I add an index variable with row numbers.

Then, I create an index column for my sample data set. If I want to sample 100 observations, I generate 100 numbers from a random uniform distribution ranging between 1 and 10,000 (the size of the population I’m drawing from). I copy and paste the value of the variables (so they won’t keep recalculating new random values), and check to make sure there are no duplicate values (if there were, I’d need to generate a few more valued until I had 100 unique values).

Now, I can use a the VLOOKUP() function to get the values for the rows indicated by my random indices.

Calculating the sample average

We can take the average value of each variable in our sample, with the hope that this will tell us something about the average in the population it came from.

R

Here’s how you would generate a neat little table with the average value for each variable in your sample.

means <- sample_1 |>
  pivot_longer(cols = where(is.numeric),
               names_to = "Variable",
               values_to = "Value") |>
  group_by(Variable) |>
  summarise(Average = mean(Value))

kable(means)
Variable Average
blue 0.400000
dt_dist 4.304352
green 0.110000
pool 0.140000
rent 995.198130
sq_feet 1494.528730

Excel

And here’s how you would calculate the averages from the sample in Excel.

Guessing the population average (continuous variables)

You can use a one-sample t-test to make a reasonable guess of what the average value is for the population a sample comes from. A t-test will give you a range of values that the average of the full population might be within, at a given level of confidence.

R

Here is how you would extract the lower and upper values of a 95-percent confidence interval for average values in R.

means <- sample_1 |>
  select(dt_dist, rent, sq_feet) |>
  pivot_longer(cols = where(is.numeric),
               names_to = "Variable",
               values_to = "Value") |>
  group_by(Variable) |>
  summarise(Average = mean(Value),
            `C.I. Low` = t.test(Value, conf.level = 0.95)$conf.int[1],
            `C.I. High` = t.test(Value, conf.level = 0.95)$conf.int[2])

kable(means)
Variable Average C.I. Low C.I. High
dt_dist 4.304352 4.165545 4.443158
rent 995.198130 962.540279 1027.855981
sq_feet 1494.528730 1467.646248 1521.411211

Excel

To calculate the 95-percent confidence interval in Excel, you’d need to know that the confidence margin is the standard deviation times the z-score corresponding to the confidence level you want.

So, first you calculate the standard deviation of the sample.

Then you calculate the z-score for the desired confidence level using the function NORM.INV.S(). The probability you should use for a 95% confidence level is 0.975. Or, if you use a probability of 0.025, you’ll get the opposite (i.e. negative) of the z-score you need.

Now you can calculate the upper and lower bounds of the confidence interval.

Guessing the population proportions (categorical variables)

Confidence intervals for proportions work the same way as confidence intervals for averages, but you use the standard error of the proportion instead of the standard deviation.

R

In R, you use prop.test rather than t.test when you want to find the confidence interval for a proportion.

means <- sample_1 |>
  select(pool, blue, green) |>
  pivot_longer(cols = where(is.numeric),
               names_to = "Variable",
               values_to = "Value") |>
  group_by(Variable) |>
  summarise(Average = mean(Value),
            `C.I. Low` = prop.test(x = sum(Value==1),
                                   n = n(), 
                                   conf.level = 0.95)$conf.int[1],
            `C.I. High` = prop.test(x = sum(Value==1),
                                   n = n(), 
                                   conf.level = 0.95)$conf.int[2])

kable(means)
Variable Average C.I. Low C.I. High
blue 0.40 0.3047801 0.5029964
green 0.11 0.0588672 0.1922335
pool 0.14 0.0814039 0.2271163

Excel

To calculate the 95-percent confidence interval in Excel, you’d need to know that the confidence margin is the standard error times the z-score corresponding to the confidence level you want.

The standard error is given by the equation:

\[ S.E. = \sqrt{\frac{p\times(1-p)}{n}} \]

Where p is the proportion and n is the number of observations.

Then you calculate the z-score and the upper and lower bounds of the confidence interval as you would for a continuous variable.

Calculating regression coefficients

We can calculate regression coefficients for our predictor variables based in sample data, with the hope that this will inform our understanding of the way these variables relate to our outcome for the full population.

R

In R, you can estimate a linear regression mode using the lm() function.

model_sample_1 <- lm(rent ~ 
                       sq_feet +
                       dt_dist +
                       pool +
                       blue +
                       green,
                     data = sample_1)

Here is how you can view the model coefficients:

model_sample_1$coefficients
## (Intercept)     sq_feet     dt_dist        pool        blue       green 
##  58.6207468   0.7271854 -55.1680183 195.7650168 112.6074978 134.4571913

Excel

Here is how you would generate regression coefficients and their associated standard errors (a standard error is sort of like a standard deviation) in Excel.

First, select a block of cells with two rows and one more column than the number of predictor variables you want in your model.

With all those cells selected, type the formula =LINEST() with four arguments: the outcome variable, the set of predictor variables, TRUE (to indicate that you also need to estimate the constant), and TRUE (to indicate that you also need standard errors)

Once you’ve entered in the formula, hold down the control and shift keys (or the command and shift keys on a Mac) while you hit Enter (or Return on a Mac).

The cells you’ve selected will populate with values. The first row will be estimated coefficients and the second row will be standard errors.

The last column will be the constant. The rest of the columns will be for the variables in your set of predictors, in the reverse order of how they appear in your dataset.

Guessing the coefficients for the population

Confidence intervals for regression coefficients work the same way as confidence intervals for averages. They are a range of values that the value for the full population would probably fall within.

R

Here is how you can view the 95-percent confidence intervals for regression coefficients in R.

confint(model_sample_1, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) -203.487013 320.7285064
## sq_feet        0.584457   0.8699139
## dt_dist      -82.693499 -27.6425375
## pool         140.956601 250.5734326
## blue          71.743364 153.4716312
## green         71.175443 197.7389397

Excel

And in Excel, you’d generate the confidence intervals using a similar process to the one you used to get the confidence interval for the averages.